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Abstract — This contribution gives an overview of face recogni- 
tion algorithms, their implementation and practical uses. First, a 
training set of different persons’ faces has to be collected and 
used to train a face recognizer. The resulting face model can be 
utilized to classify people in specific individuals or unknowns. 
After tracking the recognized face and estimating the acoustic 
sound source’s position, both can be combined to give detailed 
information about possible speakers and if they are talking or 
not. This leads to a precise real-time description of the situation, 
which can be used for further applications, e.g. for multi-channel 
speech enhancement by adaptive beamformers. 
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I. Introduction 

In recent times, the interest in and research on acoustic 
source localization and enhancement of certain sound sources 
has increased dramatically due to the growing desire for hands- 
free interaction with various devices [18]. Combining the abil- 
ity to locate sound sources and to recognize possible speakers 
with a camera potentially enables machines to identify speak- 
ers. This makes a human-machine-interaction a lot easier, more 
adaptive and reliable. Comparable systems can be used for 
teleconferencing, smart rooms or ambient assisted living [5, 8]. 

Combining a microphone array’s ability to locate sound 
sources and the intuitive way of extracting information of a 
webcam’s image, acoustic cameras have become quite popular 
for many industrial segments. They compute a color-coded 
sound map and thus, they visualize the sound pressure levels of 
a user-defined field of view. This way, acoustic cameras can 
locate sound sources quite accurately, which is why they are 
often used to identify unwanted noise sources [12, 31]. 

Face recognition is a machine learning technique, which 
ideally allows detecting and identifying all faces seen in a pic- 
ture or a video frame. It can be used for criminal detection, 
image processing, human computer interaction, etc. [33, 34]. In 
the early development of face recognition systems, geometric 
facial features, e.g. eyes, nose and mouth, were explicitly used. 
Properties of these features and relations (e.g. positions, dis- 
tances, angles) between them were used as descriptors for face 


recognition [15]. Today, holistic techniques, e.g. principal 
component analysis (see Eigenfaces) or linear discriminant 
analysis (see Fisherfaces), are used to identify individuals [3, 
13]. 

II. Basics of an Acoustic Camera 

The implementation of an acoustic camera requires a suita- 
ble microphone array as well as beamforming algorithms to 
locate sound sources precisely. Given both, a color coded 
sound map of the measured sound pressure level can be com- 
puted and displayed as seen in Figure 3 [25, 26]. 

A. A suitable microphone array 

In [26] a appropriate microphone array (see Figure 1) for 
speaker and sound source localization has been developed and 
verified. It can be shown that double ring arrays with an odd 
number of microphones on each ring are desirable for locating 
speech sources [20]. In this project, the inner ring has a diame- 
ter of 0.2 m, while the outer ring is twice as large. An im- 
portant part of acoustic cameras is a sound analysis and visual- 
ization software [31]. This software can for example be in- 
stalled on a personal computer. The connection of microphones 
with any computer is achieved by using a microphone amplifi- 
er and a multi-channel sound card. In particular, two RME 
OCTAMIC XTC amplifiers and a RME MADIface USB are 
used. Both amplifiers digitize analog signals of up to eight 
channels completely synchronously. By interconnecting two 
RME OCTAMICs in series, up to 16 analog signals can be 
converted to digital values. Thus, in order to read out all signals 
synchronously, it is necessary to activate a delay compensation 
in each amplifier [27] . 

Similar to temporal undersampling, which causes temporal 
aliasing effects, spatial undersampling can lead to spatial alias- 
ing. This effect can be observed in acoustic cameras’ color 
maps as incorrectly detected sound sources [30]. In order to 
minimize spatial aliasing, there are multiple approaches possi- 
ble. In [20] it is shown that an odd numbers of microphones on 
each ring of the array can reduce redundancy, which results in 
more robustness against spatial aliasing. Ring arrays in general 
decrease the redundancy of microphone arrays, because at a 
certain frequency only a few microphone pairs are affected by 



spatial aliasing, while others are not yet [9]. To build a sensor 
array, utilizing omnidirectional microphones, such as the se- 
lected condenser microphones AKG CK-92, has been shown 
advantageous [8, 25, 26]. 



Figure 1 : Developed double ring array 

Using more microphones and distribute them randomly on 
a plane are two possible improvements of an acoustic camera’s 
microphone array. 

B. Steered Response Power Beamforming Algorithm 

Beamforming algorithms process signals in a way, so that 
desired directions are enhanced, while signals from all other 
directions are attenuated. This chosen direction can be called 
steering direction, with which a defined plane can be spatially 
sampled. The beamformer’s output, when used in this way, is 
known as the steered response [12]. As seen in Figure 2, the 
Steered Response Power Algorithm with Phase Transform 
weighting (SRP-PHAT) calculates a color map by summing 
certain values of signal pairs’ weighted cross correlation 
(GCC). It is broadly known that a correlation results in a sig- 
nal’s power and thus, the steered response outputs a power. 
This is why the described method is known as a steered re- 
sponse power beamformer [11, 12]. 



Figure 2: Schematic diagram of the SRP-PHAT algorithm, 
according to [17] 


The mentioned GCC is similar to a regular cross correlation 
with the only difference, that weighted input signals are used. 
In order to get a PH AT weighting the Fourier transformed sig- 
nals Xi(co) and the complex conjugate of X 2 (co) are used as seen 
in (1). TDOA stands for time difference of arrival and it esti- 
mates the time difference of two sensors’ signals. The TDOAs 
for a single microphone pair differ with the steering direction 
[11,25]. 

~ |X 1 (aOX 2 («)*| (1) 

In [25] it has been shown, that best sound source localiza- 
tion results can be achieved by combining the SRP-PHAT al- 
gorithm with a constantly weighted SRP beamformer. This is 
because the SRP-PHAT is unable to process narrowband sig- 
nals, while being very robust against reverberations and sensor 
self noise. The SRP method is not as robust as the SRP-PHAT 
algorithm, but in contrast to that, it is able to locate narrowband 
signals such as sine waves or spoken vowels. A combination of 
both is implemented by utilizing a threshold for the signal’s 
bandwidth. In this application, the bandwidth’s threshold is set 
to 4 kHz, which is approximately an eighth of the chosen sam- 
pling rate. A typical output image of an acoustic camera can be 
seen in Figure 3. 



Figure 3: Resulting output image of an acoustic camera 
III. Face Detection 

An ideal face detection system should be able to detect all 
faces shown in a picture or a video frame. For this task, it 
should neither matter in which position or orientation the faces 
are, nor which age, sex or ethnical origin the people to be clas- 
sified belong to. Furthermore, an ideal face detection system 
should be insensitive to lighting changes or other external in- 
fluences [16]. 

In OpenCV, a face library is implemented, which provides 
pre-trained face detectors as well as the possibility to train own 
classifiers [2, 22]. Pre-trained classifiers for Haar-like and local 
binary pattern features support frontal face, facial landmark and 
whole person detection. If a self-trained classifier is used, sev- 
eral thousand pictures of non-faces and faces should be collect- 
ed. A good training set considers faces with differences in age, 
sex, ethnical origin, facial hair, lighting and hairstyle [6, 13, 
21]. Because of the complexity of the training process, only 



pre-trained classifiers for frontal faces are utilized in this con- 
tribution. 

While using the later described face detection algorithms 
utilizing Haar-like or local binary pattern features, often multi- 
ple faces result from one facial image. If these detected faces 
are located in a specific area close to each other, they are aver- 
aged in size and position to merge them into one detection re- 
sult. This avoids multiple detections for a single person and 
reduces false positive rates [2, 28]. 


non- face images. The number of weak classifiers per stage is 
determined by a defined false positive rate, which has to be 
achieved in each stage. Thus, it can be imagined that in the first 
few stages, only few features are necessary to get to this rate, 
but at the very last stage, very many are needed. Stages are 
added as long as a total false positive rate is met [35]. 

Using local binary patterns, more sophisticated Haar-like 
features or additional non-frontal face detectors, higher detec- 
tion accuracies can be achieved [13, 28, 35]. 


Both, the Haar cascade and the local binary pattern classifi- 
er are implemented as cascaded classifiers to quickly reject 
non-faces but still keep a high accuracy for positive results (see 
Figure 4). 



Figure 4: Schematic description of the detection cascade, ac- 
cording to [35] 


B. Local Binary Pattern Classifier 

Local binary patterns (LBPs) describe local relationships 
between neighboring pixels in a 3x3 environment. Starting in 
the top left corner and proceeding clockwise, the pixels’ gray- 
scale values are compared to the center pixels’. If the value of 
the center pixel at (n,m) is bigger than the neighbor’s value, a 0 
results, 1 else. These binary values can be put together and 
converted into a grayscale value of 0...255. Formally, this can 
be written as [16, 28] 
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A. Haar Cascade Classifier 

The Haar cascade classifier is a quite easy face detection 
method, and is therefore a very good basis for more complex 
algorithms. With a huge dataset, many different objects can be 
trained, e.g. faces, cars or whole persons. In order to classify 
images, Haar-like features (see Figure 5) are used and calculat- 
ed extremely efficiently with integral images. Thus, regional 
knowledge can be considered. As very common, the face de- 
tector introduced in [35] can only handle grayscale images [6, 
15,35]. 



Figure 5: First two Haar-like features, according to [35] 

In pictures with a resolution of 24x24 Pixel, more than 
180,000 different Haar-like features can be found. Using a ma- 
chine learning (ML) algorithm, the 6,061 most important fea- 
tures can be chosen and organized in a cascade structure. Train- 
ing the chosen features f results in a threshold 0j and a parity pj. 
With features being all pixel values added in black blocks and 
subtracted from the sum of pixel values in the white blocks, the 
weak classifier can be described as [35] 


v*)= fUU' //W<p '*' 

* in oico 


(2) 


In the first five stages of the cascade, 1,10, 25, 25 and 50 
of these weak classifiers are utilized to differ between face and 


With s(x) being 0, if x<0 , 1 else. It is quite clear, that the 
LBP is invariant to monotone grayscale transformations, e.g. 
changes in brightness or contrast. LBPs contain most infor- 
mation for maximum two changes between 0 and 1 . Examples 
therefore can be seen in Figure 6 [28]. 


(A> {B| [C) (D) (E) 
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Figure 6: LBPs for points (A, B), lines (C), edges (D) and cor- 
ners (E), according to [23] 

Similar to the stage structure of a Haar cascade classifier, a 
weak classifier is formed by using a gray value histogram. 
With H n (X) being the classifier for the n - th stage and h( n , m )(x) 
being the histogram value for listing x, the weak classifier can 
be described as in (4). The LBP classifier is not only faster than 
the Haar cascade classifier, but theoreticaly even more precise 
[3,28]. 

= ^{n,m)00 

(n.m)£W n 


IV. Face Recognition 

Even though face recognition is a much more challenging 
task than face detection is, today’s face recognition systems 
are, at least under optimal conditions, very reliable [2, 14]. 
Thus, many of today’s applications use these methods to identi- 
fy people in images, e.g. Facebook’s Gallery or Apple iPhoto 
[32]. When classifying, problems mostly occur due to varia- 
tions in light, perspective or facial expression [14, 32]. Fur- 




thermore, similar looking individuals, e.g. father and son or 
twins, can cause uncertainties when differing them [15]. 

In order to get a higher face recognition accuracy, different 
approaches can be imagined. In general, it is recommended to 
use large datasets with many variations in pose, age and light- 
ing conditions for training the model [3, 13, 32]. Another pos- 
sibility to improve the recognition performance is to use infra- 
red lighting to avoid shades or other disruptions. Furthermore, 
additional features, which only occur under invisible light, e.g. 
freckles and pigmentation, can be used to recognize faces [15]. 

In OpenCV, face recognizers using principal component 
analysis (Eigenfaces), linear discriminant analysis (Fisherfaces) 
and local binary pattern histograms are implemented [23]. Uti- 
lizing any of these methods, the face recognizer has to be 
trained with own face-images to differentiate between individ- 
uals. The classification is done by comparing the images’ fea- 
tures in a high-dimensional feature space with a K-nearest 
neighbor algorithm [6]. 

A. Eigenface Classifier 

The Eigenface classifier uses a principal component analy- 
sis (PC A) to reduce the dimensionality of the images. Utilizing 
a PCA, E eigenvectors with the highest eigenvalues can be 
selected to describe the given dataset. These eigenvectors span 
a quite low-dimensional face space, in which every image can 
be projected. Because the PCA’s eigenvectors, after reshaping 
them into an image format, look very much like faces (see Fig- 
ure 7), they are called Eigenfaces [4, 34]. 



Figure 7: First 20 Eigenfaces of the AT&T face dataset, ac- 
cording to [23] 

Using the extracted Eigenfaces, unknown faces can be re- 
constructed (see Figure 8). This gives a good feeling about how 
many principal components are necessary to distinguish be- 
tween individuals. Usually a number of 40 to 80 should be suf- 
ficient, but, depending on the dataset, sometimes up to 300 
Eigenfaces should be used [13, 23, 36]. Figure 8 shows that the 
original face can be recognized starting at 20 Eigenfaces. 


In order to calculate a dataset’s (T 1 ^ 2 , r 3 , /V eigenvec- 
tors efficiently, its vectorized mean image W and the differ- 
ences 0i=Ti- V are needed. Given a face matrix 
A = [0i 02 ... 0r], the covariance matrix of all face-images 
can be calculated as [34] 

R 

C = j£»r‘H : = AA T ,5) 

r= 1 

This Matrix C inherits a dimension of N 2 xN 2 (for face 
images with a resolution of NxN), which means that N 2 
eigenvektors uu have to be determined. This requires high 
computational resources and thus, it is unsuitible for real-time 
applications. For the common case that R<N 2 , using a 
workaround by building a RxR dimensional Matrix L = A T A, 
only R eigenvectors vi are to be calculated. Utilizing them, the 
originally disired eigenvektors ui can be determined as [19, 34] 

R 

u, = rYj V i ' 0 ” * = 1 R ( 6 ) 

r = 1 



Figure 8: Reconstruction of a face using Eigenfaces, according 
to [23]. Number of utilized Eigenfaces (top left to bottom 
right): 1, 2, 3, 4, 5, 10, 20, 30, 40, 50, 60, 80, 90, 100, 125, 
150, 175, 200, 300, original face 

Every new facial image can be disassembled into E eigen- 
vectors as seen in (7). The resulting weights can be used to 
descibe a weight- vector O = (coi, < 202 , ..., coe) t [34]. 

co k =u T k -(r-r> (7) 

Appling an Euclidian distance measure and the K-nearest 
neighbor method, faces can be classified [19, 34]. 

B. Fisherface Classifer 

Dimensionality reduction by linear discriminant analysis 
(LDA) can counter the PCA’s disadvantage, not considering 
any class dependencies while projecting images in a feature 
space. Using Fisher’s linear discriminant analysis (FLD), the 
classes stay linearly separable, which makes classification easi- 
er and more reliable, especially for changes in lighting condi- 


tions. Thus, E orthonormal vectors describe a matrix W , so that 
it maximizes between-class scatter (see (8)) but minimizes 
within-class scatter (see (9)). For both, matrices Sb , -SVhave to 
be defined 


classifier, the Haar cascade classifier is a little slower, but still 
capable of consistently classify pictures into faces and non- 
faces. In order to detect the left and right eye in an image, the 
corresponding Haar cascade eye detectors are utilized. 
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where C is the number of different classes, Ni the number of 
test images of class X t and Wi is its mean image. For face 
recogniton tasks, the LDA projection W op t can be written as in 
(10). Using for example a PCA, the dimension reduces to N-C, 
while the FLD reduces it further to C-l [4]. 

Wl pt ' = w T fld w T pca 

(10) 

with 
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The Fisherface methods provides better handling for back- 
ground and lighting, than Eigenfaces do [13]. Furthermore, 
Fisherfaces are much more reliable when using a small training 
set or faces differing heavily from the training data, e.g. wear- 
ing glasses or facial expressions [4, 23]. 

C. Local Binary Pattern Histogram Classifier 

The classification using local binary pattern histograms 
(LBPH) is quite similar to the face detection with LBP. Differ- 
ences are that in order to identify individuals, a K-nearest 
neighbor method is utilized and the LBP operator can be ex- 
tended to get results that are more reliable. For this, multiple 
approaches are possible. Instead of considering eight direct 
neighbors, P neighboring pixels on a radius R can be used for 
the generalized LBP p>r operator [1, 15, 23]. Another option is 
to use a multi-block LBP to compare the avarage gray scale 
values of neighboring pixel blocks with the avarage of a 
centered region [15]. 

The LBPH classifier’s main disadvantage is that it is quite 
slow and therefore unsuitable for fluent video playback in real- 
time situations [1]. Thus, as described in the following chapter, 
it cannot be used in the implemented application. 

V. Application of a Face Recognition System 

In order to compensate changes in lighting, face rotation, 
background and hairstyle, some preprocessing steps are taken 
before recognizing faces (see Figure 9). It can be assumed that 
this allows to apply the classifiers not only to constrained envi- 
ronments, but also to any [34]. 

As a suitable face detection algorithm, the Haar cascade 
classifier is chosen. Even though, OpenCV’s LBP classifier 
shows in practice an approximately 61 % faster processing 
time, it is less accurate and is unable to detect faces reliably in 
an artificially lighted office room. In comparison to the LBP 





(£) face detection 


(C) resize image to 
lower resolution 


(H) eye detection 


{F, G) cut out appro*, 
eye region+histogram 
equalization 



Figure 9: Preprocessing steps for face recognition, according 



Figure 10: Recognition accuracy vs. number of training faces 

To find the best face recognizer for the described applica- 
tion, several tests are considered. In particular, these tested the 
accuracy, training duration, model size and recognition speed 
over number of components (Eigenfaces/Fisherfaces) and clas- 
ses in the training set as well as their total recognition accuracy 
over the number of training images (see Figure 10). 


It can be seen, that, after applying the preprocessing steps, 
the Eigenface method constantly outperforms the Fisherface 
method, even though. For three and more faces, the FBPH is 
slightly more accurate than the Eigenface method. The test also 
shows that the training duration of Eigenfaces is for relatively 
small numbers of components slightly lower than the Fisher- 
face method, while this changes for larger numbers of compo- 
nents. The training duration of LBPH is by far longer, especial- 
ly when using larger radii and more neighbors. This also shows 
in comparing them at model sizes. LBPHs’ model sizes are a 
lot larger than PC As’ and LDAs’ models, which show the same 
magnitude, even though the LDA models are smaller. The 
Fisherfaces’ recognition speeds are, starting at equally many 
classes and components, slightly higher than the Eigenfaces’. 
Before that point, both are almost identically high and about 
five times bigger than the EBPH’s recognition speed. As seen 
in Figure 11, the Fisherfaces’ recognition accuracy reaches its 
maximum approximately at the number of components being 
equal to the number of classes. This can be explained with the 
amount of class in the dataset. As described in section IV.B, 
the number of components is limited to C-7, which means add- 
ing additional components would not increase the recognition 
accuracy. Similar to this, the Eigenfaces’ recognition accuracy 
reaches its maximum at R , the number of images in the training 
set (see section IV. A). Using five classes times ten images (mi- 
nus two for testing), this limit is reached at approximately 40 
components. 



Figure 1 1 : Recognition accuracy vs. number of components 

These performance measures suggest using an Eigenface 
classifier because of its higher recognition accuracy and else 
similar properties. Further tests has shown, that using approxi- 
mately 30 components, show best recognition results. This 
matches literatures suggestions [13, 23, 36]. [4] recommends 
not using the first three Eigenfaces to achieve even higher ac- 
curacies. Unfortunately this option is not supported by 
OpenCV [24]. 

In the final application, images for a face dataset have been 
collected. There, four individuals with approximately 1400 
images are considered. The system implemented in OpenCV 
runs in real-time, providing approximately 15-18 frames per 
second. 


VI. Fusion of the Focalization Results for Speaker 
Identification 

A fusion of the localization results of sound source locali- 
zation and face recognition is able to enhance the reliability of 
a speaker detection system and enables it to identify the speak- 
er. This can be used in smart rooms, improved speaker tracking 
for videoconferencing or applications for ambient assisted liv- 
ing [7, 8, 29]. Furthermore, an extension towards gesture 
recognition for human machine interaction is possible [7, 10]. 
Therefore, a speaker identification algorithm is developed and 
introduced in this contribution. 

In order to track and identify speakers, reliable sound 
source localization and face recognition are necessary. The 
sound source and therefore the potential speaker, is located by 
finding the color map’s maximum. The face recognition pro- 
vides a specific localization as well, but sometimes there are 
some uncertainties, which have to be eliminated. These could 
be falsely positive detected and recognized faces or wrongly 
classified individuals. To overcome these problems, three rec- 
ognized faces are compared. If all of them are classified as the 
same person, the result is shown at the face’s new position. 
Another possibility is that the recognized face matches one of 
the two previously identified individuals and is in approximate- 
ly the same localization area as the currently detected. This 
results in a certain recognition at the currently classified face’s 
position, too. If both options do not apply, no recognition result 
is being displayed and it is ignored like there never has been a 
face in the image. Using both, the speaker and face localization 
results, the overall localization and identification can be 
achieved as seen in Figure 12. The estimated outcome can dif- 
ferentiate between following: no result, identified speaker, face 
only, unknown speaker or loudspeaker, loudspeaker and known 
face at two different positions. Whenever possible, the localiza- 
tion position is chosen to the face’s location, because it is well 
known that optical tracking algorithms have better spatial reso- 
lution than acoustic localization techniques [8]. 



Figure 12: Decision tree for localization result 


Figure 13 shows a possible output image of the implement- 
ed speaker identification system. There, a speaker’s face and 
sound source location are detected and merged for a more pre- 
cise localization and identification of the sound source. The red 
circle marks the speaker’s approximate position. 



Figure 13: Result of the speaker identification system 


VII. Conclusion 

This contribution briefly explains the basics of an acoustic 
camera and shows, why it makes sense to use a double ring 
array with an odd number of microphones. Additionally, it 
gives an overview of the implemented sound source localiza- 
tion methods. It can be shown, that a combination of SRP and 
SRP-PHAT algorithms is desirable for speech localization. 
Furthermore, this contribution gives an introduction to face 
detection and recognition methods. It is shown that Haar cas- 
cade classifiers outperform local binary pattern classifiers, 
when detecting faces. Similar, it is pointed out that for a face 
recognition system, Eigenfaces should be preferred to Fisher- 
faces and local binary pattern histograms. Finally, an algorithm 
for the fusion of localization results is introduced. This com- 
bines sound sources localization and face detection to identify 
speakers reliably. 
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